Numexpr and the need for speed

In [1]:
import numpy as np
import numexpr as ne
import numpy.random as npr

I've been meaning to try out numexpr for a while. Numexpr is a library that enables multithreading of expressions on NumPy arrays and removes the need to allocate temporary arrays. If these statements are true, I'd like to put in some pull requests to accelerate libraries I like, such as the Chainer library for neural networks.

To test the speed-ups, I'll compare the performance of various functions as implemented in NumPy and Numexpr.

In [2]:
# First, a simple function to subtract a 'gradient' from parameters
def numpy_sgd_update(param, grad):
    param -= grad
    return param
    
def numexpr_sgd_update(param, grad):
    ne.evaluate('param - grad', out=param)
    return param
In [3]:
# Check that the functions work, the output is the same so we're good
param = npr.randn(3)
grad = npr.randn(3)

print(numpy_sgd_update(param.copy(), grad))
print(numexpr_sgd_update(param.copy(), grad))
[-0.43502512  0.77373611  1.24584445]
[-0.43502512  0.77373611  1.24584445]

Now that we know the functions do essentially the same thing, let's give the functions some arguments

In [4]:
#First, use vectors the same size as a 4096*4096 fully-connected layer as used in VGG-19
np_param = npr.randn(4096*4096).astype(np.float32)
ne_param = np_param.copy()
grad = 0.01 * npr.randn(4096*4096).astype(np.float32)
In [5]:
%%timeit
numpy_sgd_update(np_param, grad)
8.81 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]:
%%timeit
numexpr_sgd_update(ne_param, grad)
7.7 ms ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]:
# Next, a smaller vector approximately the size of a 3x3x3x64 convolution filter
np_param = npr.randn(3*3*3*64).astype(np.float32)
ne_param = np_param.copy()
grad = 0.01 * npr.randn(3*3*3*64).astype(np.float32)
In [8]:
%%timeit
numpy_sgd_update(np_param, grad)
1.08 µs ± 14.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [9]:
%%timeit
numexpr_sgd_update(ne_param, grad)
11.5 µs ± 40.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

On the smaller array, the overhead of numexpr's compilation and virtual machine makes evaluation take longer by almost a factor of 12. However, the larger array computations are sped up by between 1 and 1.5 with numexpr. Per iteration, this means that the small array subtraction takes 11 microseconds longer, but large array subtraction takes over a millisecond less with numexpr

Now we examine the speed-ups on the Adam update, which requires more array operations and may yield different results.

In [10]:
def numpy_adam_update(param, grad, m, v):
    m += (1 - 0.9) * (grad - m)
    v += (1 - 0.999) * (grad * grad - v)
    param -= 0.001 * m / (np.sqrt(v) + 1e-8)
    return param, m, v
    
def numexpr_adam_update(param, grad, m, v):
    ne.evaluate('m + (1 - 0.9) * (grad - m)', out=m, casting='same_kind')
    ne.evaluate('v + (1 - 0.999) * (grad * grad - v)', out=v, casting='same_kind')
    ne.evaluate('param - 0.001 * m / (sqrt(v) + 1e-8)', out=param, casting='same_kind')
    return param, m, v
In [11]:
np_param = npr.randn(3).astype(np.float32)
np_m = npr.randn(3).astype(np.float32)
np_v = npr.rand(3).astype(np.float32)

ne_param = np_param.copy()
ne_m = np_m.copy()
ne_v = np_v.copy()

grad = 0.01 * npr.randn(3).astype(np.float32)
In [12]:
print(numpy_adam_update(np_param, grad, np_m, np_v))
print(numexpr_adam_update(ne_param, grad, ne_m, ne_v))
(array([-0.8058309 ,  0.62934005,  1.82563162], dtype=float32), array([-1.02522945, -0.57366568, -0.49004886], dtype=float32), array([ 0.4954676 ,  0.99638748,  0.75871485], dtype=float32))
(array([-0.8058309 ,  0.62934005,  1.82563162], dtype=float32), array([-1.02522945, -0.57366568, -0.49004886], dtype=float32), array([ 0.4954676 ,  0.99638748,  0.75871485], dtype=float32))

The functions return the same result so let's try some timing

In [13]:
#First, use vectors the same size as a 4096*4096 fully-connected layer as used in VGG-19
np_param = npr.randn(4096*4096).astype(np.float32)
np_m = npr.randn(4096*4096).astype(np.float32)
np_v = npr.rand(4096*4096).astype(np.float32)

ne_param = np_param.copy()
ne_m = np_m.copy()
ne_v = np_v.copy()

grad = 0.01 * npr.randn(4096*4096).astype(np.float32)
In [14]:
%%timeit
numpy_adam_update(np_param, grad, np_m, np_v)
99.8 ms ± 5.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]:
%%timeit
numexpr_adam_update(ne_param, grad, ne_m, ne_v)
50.1 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]:
#First, use vectors the same size as a 3x3x3x64 convolutional layer
np_param = npr.randn(3*3*3*64).astype(np.float32)
np_m = npr.randn(3*3*3*64).astype(np.float32)
np_v = npr.rand(3*3*3*64).astype(np.float32)

ne_param = np_param.copy()
ne_m = np_m.copy()
ne_v = np_v.copy()

grad = 0.01 * npr.randn(3*3*3*64).astype(np.float32)
In [17]:
%%timeit
numpy_adam_update(np_param, grad, np_m, np_v)
14.5 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [18]:
%%timeit
numexpr_adam_update(ne_param, grad, ne_m, ne_v)
58.6 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here there is a similar pattern with timings, but more extreme. In the large array case, numexpr is faster by about a factor of 2, likely due to the reduction in temporary allocations. While it is slower on a small array, the difference is by a factor of 4-5 instead of 12, also because of the larger number of intermediate operations in the adam update rule.

We can also look at activation functions computed during forward passes as well, such as the ReLU here. Even though numexpr doesn't have a max function, it still demonstrates a sizeable performance increase on a large array using the 'where' idiom. Since mini-batches are often considerably larger in array form than convolution filters themselves, this speedup could be incredibly valuable in inference speed

In [27]:
# Make sure these things give the same result
a = np.array([0.5, -0.1, -1.0, 5.0], dtype=np.float32)
print(np.fmax(a, 0))
print(ne.evaluate('where(a>0,a,0)'))
[ 0.5  0.   0.   5. ]
[ 0.5  0.   0.   5. ]
In [28]:
np_parm = npr.randn(4096*4096).astype(np.float32)
ne_parm = np_parm.copy()
In [29]:
%%timeit
np.fmax(np_parm, 0)
87.9 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [30]:
%%timeit
ne.evaluate('where(ne_parm>0, ne_parm, 0)')
20.7 ms ± 86.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Given that plain numpy is fastest on small arrays, but numexpr gives big improvements on large ones, we can try to combine both methods of evaluation to speed up the small-array case

In [51]:
def opt_eval(np_func, ne_func, *args):
    """
    Crude function to evaluate numpy version if the arguments are small enough,
    or numexpr version if arguments are of large enough size
    """
    if np.prod(args[0].shape) < 2**10:
        return np_func(*args)
    else:
        return ne_func(*args)
In [56]:
#For this, compute tanh of relu on array
def np_func(data):
    return np.tanh(np.fmax(data, 0))

def ne_func(data):
    return ne.evaluate('tanh(where(data>0, data, 0))')

#smallest array
data0 = npr.randn(2**6).astype(np.float32)

#small array
data1 = npr.randn(2**12).astype(np.float32)

#medium array
data2 = npr.randn(2**18).astype(np.float32)

#large array
data3 = npr.randn(2**24).astype(np.float32)

On the small array, the overhead of calculating the dimensions and the if-statement appear to be about 6 microseconds per iteration. This is similar to the overhead in numexpr evaluation

In [57]:
%%timeit
np_func(data0)
2.12 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [58]:
%%timeit
ne_func(data0)
10.4 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [59]:
%%timeit
opt_eval(np_func, ne_func, data0)
8.57 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The if-statement and dimension calculation in the optimal evaluation makes it a tad slower than the numexpr evaluation, but still faster than pure numpy

In [61]:
%%timeit
np_func(data1)
82.5 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [62]:
%%timeit
ne_func(data1)
42.9 µs ± 674 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [63]:
%%timeit
opt_eval(np_func, ne_func, data1)
59.2 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here, the overhead is markedly less significant and the speedup is more than 10x. This is a possible mini-batch size in a neural network, suggesting great speedups are possible.

In [64]:
%%timeit
np_func(data2)
5.39 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [65]:
%%timeit
ne_func(data2)
518 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]:
%%timeit
opt_eval(np_func, ne_func, data2)
501 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The speed-ups only grow as the size of the array grows, being more than 10x here.

In [67]:
%%timeit
np_func(data3)
367 ms ± 6.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [68]:
%%timeit
ne_func(data3)
28.8 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [69]:
%%timeit
opt_eval(np_func, ne_func, data3)
29.1 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusions:

My first observation is a confirmation of what the numexpr authors say; numexpr can lead to huge speed-ups on large arrays, and the speed-up grows with the array size. On very small arrays, numexpr evaluation overhead eliminates the benefits of these speed-ups. However, this overhead is about the same as a python if-statement (not very significant).